Skip to content

add E2E testing framework#26

Merged
janisz merged 13 commits intomainfrom
e2e-tests
Feb 5, 2026
Merged

add E2E testing framework#26
janisz merged 13 commits intomainfrom
e2e-tests

Conversation

@janisz
Copy link
Contributor

@janisz janisz commented Jan 15, 2026

Description

Enhanced tool descriptions and parameter schemas to better guide LLMs on when to use optional parameters and which tools to select for different query types. Added mcp-testing-framework configuration with 8 test cases covering CVE queries and cluster operations, achieving 87.5% pass rate with GPT-5 models.

Validation

./scripts/run-tests.sh
══════════════════════════════════════════════════════════
  StackRox MCP E2E Testing with Gevals
══════════════════════════════════════════════════════════

Loading environment variables from .env...
Configuration:
  Agent Model: gpt-4o
  Judge Model: gpt-4o
  MCP Server: stackrox-mcp (via go run)

Running gevals tests...


=== Starting Evaluation ===

Task: list-clusters
  Difficulty: easy
  → Running agent...
  → Verifying results...
  ✓ Task passed

Task: cve-affecting-workloads
  Difficulty: easy
  → Running agent...
  → Verifying results...
  ✓ Task passed

Task: cve-affecting-clusters
  Difficulty: easy
  → Running agent...
  → Verifying results...
  ✓ Task passed

Task: cve-nonexistent
  Difficulty: easy
  → Running agent...
  → Verifying results...
  ✓ Task passed

Task: cve-cluster-scooby
  Difficulty: easy
  → Running agent...
  → Verifying results...
  ✓ Task passed

Task: cve-cluster-maria
  Difficulty: easy
  → Running agent...
  → Verifying results...
  ✓ Task passed

Task: cve-clusters-general
  Difficulty: easy
  → Running agent...
  → Verifying results...
  ✓ Task passed

Task: cve-cluster-list
  Difficulty: easy
  → Running agent...
  → Verifying results...
  ✓ Task passed

=== Evaluation Complete ===

📄 Results saved to: gevals-stackrox-mcp-e2e-out.json

=== Results Summary ===

Task: list-clusters
  Path: /home/janisz/go/src/github.com/stackrox/stackrox-mcp/e2e-tests/gevals/tasks/list-clusters.yaml
  Difficulty: easy
  Task Status: PASSED
  Assertions: PASSED (3/3)

Task: cve-affecting-workloads
  Path: /home/janisz/go/src/github.com/stackrox/stackrox-mcp/e2e-tests/gevals/tasks/cve-affecting-workloads.yaml
  Difficulty: easy
  Task Status: PASSED
  Assertions: PASSED (3/3)

Task: cve-affecting-clusters
  Path: /home/janisz/go/src/github.com/stackrox/stackrox-mcp/e2e-tests/gevals/tasks/cve-affecting-clusters.yaml
  Difficulty: easy
  Task Status: PASSED
  Assertions: PASSED (3/3)

Task: cve-nonexistent
  Path: /home/janisz/go/src/github.com/stackrox/stackrox-mcp/e2e-tests/gevals/tasks/cve-nonexistent.yaml
  Difficulty: easy
  Task Status: PASSED
  Assertions: PASSED (3/3)

Task: cve-cluster-scooby
  Path: /home/janisz/go/src/github.com/stackrox/stackrox-mcp/e2e-tests/gevals/tasks/cve-cluster-scooby.yaml
  Difficulty: easy
  Task Status: PASSED
  Assertions: PASSED (3/3)

Task: cve-cluster-maria
  Path: /home/janisz/go/src/github.com/stackrox/stackrox-mcp/e2e-tests/gevals/tasks/cve-cluster-maria.yaml
  Difficulty: easy
  Task Status: PASSED
  Assertions: PASSED (3/3)

Task: cve-clusters-general
  Path: /home/janisz/go/src/github.com/stackrox/stackrox-mcp/e2e-tests/gevals/tasks/cve-clusters-general.yaml
  Difficulty: easy
  Task Status: PASSED
  Assertions: PASSED (3/3)

Task: cve-cluster-list
  Path: /home/janisz/go/src/github.com/stackrox/stackrox-mcp/e2e-tests/gevals/tasks/cve-cluster-list.yaml
  Difficulty: easy
  Task Status: PASSED
  Assertions: PASSED (3/3)

=== Overall Statistics ===
Total Tasks: 8
Tasks Passed: 8/8
Assertions Passed: 24/24

=== Statistics by Difficulty ===

easy:
  Tasks: 8/8
  Assertions: 24/24

══════════════════════════════════════════════════════════
  Tests Completed Successfully!
══════════════════════════════════════════════════════════

@codecov-commenter
Copy link

codecov-commenter commented Jan 15, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 77.36%. Comparing base (bc05b10) to head (a29703e).
⚠️ Report is 2 commits behind head on main.
✅ All tests successful. No failed tests found.

Additional details and impacted files
@@           Coverage Diff           @@
##             main      #26   +/-   ##
=======================================
  Coverage   77.36%   77.36%           
=======================================
  Files          26       26           
  Lines        1162     1162           
=======================================
  Hits          899      899           
  Misses        223      223           
  Partials       40       40           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@janisz janisz marked this pull request as draft January 19, 2026 17:39
janisz and others added 5 commits January 23, 2026 14:59
Enhanced tool descriptions and parameter schemas to better guide LLMs on when to use optional parameters and which tools to select for different query types. Added mcp-testing-framework configuration with 8 test cases covering CVE queries and cluster operations, achieving 87.5% pass rate with GPT-5 models.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Tomasz Janiszewski <tomek@redhat.com>

# Conflicts:
#	internal/toolsets/config/tools.go
Signed-off-by: Tomasz Janiszewski <tomek@redhat.com>
Fix E2E test assertion failures by improving tool descriptions with
smart usage pattern guidance. Tool descriptions now clearly indicate:

- When to call all three CVE tools for comprehensive coverage
  ("Is CVE-X detected in my clusters?" without specific cluster name)
- When to call only specific tools for targeted queries
  ("Is CVE-X detected in cluster staging-central-cluster?")

Changes:
- Update vulnerability tool descriptions (clusters, deployments, nodes)
  to use directive language and clear usage patterns
- Adjust cve-nonexistent test maxToolCalls from 2 to 3 to match
  comprehensive check pattern
- Update cve-cluster-does-not-exist verification to accept both
  "CVE not detected" and "cluster doesn't exist" responses

Results: All 24/24 E2E test assertions now pass (improved from 21/24).

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
…criptions

Changes:
- Switch E2E agent from GPT-4o to Claude Sonnet 4.5 via Vertex AI
- Add enableAllTools: true to MCP config for auto-approval
- Configure gpt-5-nano as LLM judge for cost efficiency
- Improve CVE tool descriptions with clear WHEN TO USE/WHEN NOT TO USE sections
- Update test assertions to account for Claude's comprehensive CVE checking behavior
- Update run-tests.sh to export Vertex AI environment variables

The tool descriptions now explicitly guide when to use each CVE detection tool:
- General "clusters" queries → comprehensive check (all 3 tools)
- Specific component queries → single relevant tool only
- Single cluster queries → orchestrator tool with cluster filter

All 8 E2E tests passing with 24/24 assertions.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Tomasz Janiszewski <tomek@redhat.com>
@janisz janisz changed the title Improve LLM tool parameter guidance and add E2E testing framework add E2E testing framework Jan 23, 2026
- Update README.md with complete env var configuration
- Fix jq command examples (path and property names)
- Add AGENT_MODEL_NAME configuration to run-tests.sh
- Clarify cluster ID-only requirement in tool descriptions
- Add explanatory comments to eval.yaml about assertion fields
- Improve list-clusters verification text
- Remove leftover mcp-testing-framework.yaml file

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@janisz janisz requested a review from mtodor January 29, 2026 15:35
@janisz janisz marked this pull request as ready for review January 29, 2026 15:36
Copy link
Collaborator

@mtodor mtodor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! I have added a few questions and thoughts. Nothing crucial.

I didn't review the tasks because we will replace them in a follow-up.

Co-authored-by: Mladen Todorovic <mtodor@gmail.com>
janisz and others added 5 commits February 2, 2026 17:51
Signed-off-by: Tomasz Janiszewski <tomek@redhat.com>
- Upgrade from gevals v0.0.1 to mcpchecker v0.0.4
- Move e2e-tests Go module to tools/ subdirectory to fix module resolution issue
  when running MCP server from mcpchecker directory
- Rename gevals/ directory to mcpchecker/
- Update build script: build-gevals.sh → build-mcpchecker.sh
- Update all references in documentation and scripts
- Fix jq commands in README for new mcpchecker JSON structure
- Remove gevals dependency from root go.mod
- Add Dependabot configuration to monitor both root and e2e-tests/tools modules

All tests passing (8/8 tasks, 24/24 assertions).

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Add smoke test script that validates e2e test configuration without
requiring actual agents or API keys. This allows CI to catch configuration
errors early.

Changes:
- Add e2e-tests/scripts/smoke-test.sh to validate:
  - mcpchecker binary builds
  - MCP server compiles
  - YAML configuration files are valid
  - Task files exist and are parseable
- Add .github/workflows/e2e-smoke-test.yml for CI integration
- Update README with smoke test section

The smoke test runs in <30s and requires no secrets, making it ideal
for PR validation.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@janisz janisz requested a review from mtodor February 3, 2026 17:12
Copy link
Collaborator

@mtodor mtodor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! 🏆

Added a few nitpicks, nothing crucial or something that we can do in a followup.

- Merge e2e-smoke-test.yml into test.yml to eliminate duplicate builds
- Simplify smoke-test.sh to only build and verify mcpchecker binary
- Remove MCP server build from smoke test (already built by test workflow)
- Remove YAML validation from smoke test (will use yamllint in separate PR)
- Add Makefile target for e2e-smoke-test
- Add go mod tidy verification using find for all Go modules
- Use find for dependency downloads to support multiple modules

This addresses PR review feedback and reduces CI build time by avoiding
duplicate checkout and build operations.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@janisz janisz merged commit cb19cfb into main Feb 5, 2026
4 checks passed
@janisz janisz deleted the e2e-tests branch February 5, 2026 15:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants